Bootstrapping Unsupervised Bilingual Lexicon Induction
نویسندگان
چکیده
The task of unsupervised lexicon induction is to find translation pairs across monolingual corpora. We develop a novel method that creates seed lexicons by identifying cognates in the vocabularies of related languages on the basis of their frequency and lexical similarity. We apply bidirectional bootstrapping to a method which learns a linear mapping between context-based vector spaces. Experimental results on three language pairs show consistent improvement over prior work.
منابع مشابه
Cross-Lingual Bootstrapping of Semantic Lexicons: The Case of FrameNet
This paper considers the problem of unsupervised semantic lexicon acquisition. We introduce a fully automatic approach which exploits parallel corpora, relies on shallow text properties, and is relatively inexpensive. Given the English FrameNet lexicon, our method exploits word alignments to generate frame candidate lists for new languages, which are subsequently pruned automatically using a sm...
متن کاملAdversarial Training for Unsupervised Bilingual Lexicon Induction
Word embeddings are well known to capture linguistic regularities of the language on which they are trained. Researchers also observe that these regularities can transfer across languages. However, previous endeavors to connect separate monolingual word embeddings typically require cross-lingual signals as supervision, either in the form of parallel corpus or seed lexicon. In this work, we show...
متن کاملEarth Mover's Distance Minimization for Unsupervised Bilingual Lexicon Induction
Cross-lingual natural language processing hinges on the premise that there exists invariance across languages. At the word level, researchers have identified such invariance in the word embedding semantic spaces of different languages. However, in order to connect the separate spaces, cross-lingual supervision encoded in parallel data is typically required. In this paper, we attempt to establis...
متن کاملLexical and Grammatical Inference
Children are facile at both discovering word boundaries and using those words to build higher-level structures in tandem. Current research treats lexical acquisition and grammar induction as two distinct tasks; doing so has led to unreasonable assumptions. State-ofthe-art unsupervised results presuppose a perfectly segmented, noise-free lexicon, while largely ignoring how the lexicon is used. T...
متن کاملSupervised Bilingual Lexicon Induction with Multiple Monolingual Signals
Prior research into learning translations from source and target language monolingual texts has treated the task as an unsupervised learning problem. Although many techniques take advantage of a seed bilingual lexicon, this work is the first to use that data for supervised learning to combine a diverse set of signals derived from a pair of monolingual corpora into a single discriminative model....
متن کامل